An external validation of coding for childhood maltreatment in routinely collected primary and secondary care data

Validated methods of identifying childhood maltreatment (CM) in primary and secondary care data are needed. We aimed to create the first externally validated algorithm for identifying maltreatment using routinely collected healthcare data. Comprehensive code lists were created for use within GP and hospital admissions datasets in the SAIL Databank at Swansea University working with safeguarding clinicians and academics. These code lists build on and refine those previously published to include an exhaustive set of codes. Sensitivity, specificity and positive predictive value of previously published lists and the new algorithm were estimated against a clinically assessed cohort of CM cases from a child protection service secondary care-based setting—‘the gold standard’. We conducted sensitivity analyses to examine the utility of wider codes indicating Possible CM. Trends over time from 2004 to 2020 were calculated using Poisson regression modelling. Our algorithm outperformed previously published lists identifying 43–72% of cases in primary care with a specificity ≥ 85%. Sensitivity of algorithms for identifying maltreatment in hospital admissions data was lower identifying between 9 and 28% of cases with high specificity (> 96%). Manual searching of records for those cases identified by the external dataset but not recorded in primary care suggest that this code list is exhaustive. Exploration of missed cases shows that hospital admissions data is often focused on the injury being treated rather than recording the presence of maltreatment. The absence of child protection or social care codes in hospital admissions data poses a limitation for identifying maltreatment in admissions data. Linking across GP and hospital admissions maximises the number of cases of maltreatment that can be accurately identified. Incidence of maltreatment in primary care using these code lists has increased over time. The updated algorithm has improved our ability to detect CM in routinely collected healthcare data. It is important to recognize the limitations of identifying maltreatment in individual healthcare datasets. The inclusion of child protection codes in primary care data makes this an important setting for identifying CM, whereas hospital admissions data is often focused on injuries with CM codes often absent. Implications and utility of algorithms for future research are discussed.

www.nature.com/scientificreports/ combining these two data sources. We will explore the reasons for missed and incorrectly identified cases in each healthcare setting. We also assess variation in CM rates over time.

Methods
Study design. This is a retrospective e-cohort study.

Ethical approval. Approval was granted on 4/12/2018 by the Swansea University Information Governance
Review Panel (IGRP) (approval number 0809), an independent body consisting of a range of government, regulatory and professional agencies (British Medical Association (BMA), National Research Ethics Service (NRES), Involving People, NHS Wales Informatics Service and Public Health Wales (PHW) NHS Trust) and members of the public, which grants approval to studies conducted within the SAIL Databank. All methods were performed in accordance with the relevant guidelines and regulations and in line with the permissions granted under these ethical approvals. All data within the SAIL gateway are treated in accordance with the Data Protection Act 2017 and are compliant with the General Data Protection Regulation (GDPR).
Informed consent was not required as this study utilizes fully anonymised data in accordance with the GDPR.
Data source. We linked data on an individual level via the Adolescent Mental Health Data Platform (ADP), an international data platform that supports mental health research in children and young people (CYP). For our study, the ADP used datasets from the SAIL Databank, a repository of routinely collected health and education datasets for the population of Wales 22,23 . All data are treated in accordance with the Data Protection Act 2018.
The following datasets were linked a patient level: Welsh Demographic Service (WDS), Welsh Index of Multiple Deprivation containing deprivation scores for all lower super output areas in Wales; GP database (GPD), containing information for all GP interactions covering 79% of the Welsh population; Patient Episode Database for Wales (PEDW), containing data for all NHS Wales hospital admissions.

Externally validated dataset. Cardiff and Vale University Health Board Minimum dataset for CM
(CVCM) was imported into SAIL databank via the split-file method for anonymisation. The CVCM dataset comprises 3622 clinical assessments pertaining to 3123 children for suspected CM and includes date of assessment, type of abuse suspected (i.e., physical, neglect, sexual), reason for suspicions, details, and confirmation of findings. Three quarters (75.8%, 2747/3622) of the assessments were conducted on the basis of suspicion of physical abuse, 17% were sexual abuse and around 4% were neglect. For the purpose of this study CM was examined overall with no further stratification by maltreatment type due to the number of cases available within each sub-category. Within the dataset, the outcomes of the clinical assessments were divided into three categoriesconfirmed maltreatment, possible maltreatment and no maltreatment. Of the 3,123 children, 388 (12.4%) had been seen on more than one occasion and 2889/3123 (92.5%) were under 18 were selected as the baseline population. Data collection began at GP registration date plus six months if newly registered (to avoid misclassification due to retrospective recording at registration), except for the under 1 s, who were followed from GP registration date or study onset whichever was the latest. Data collection ended on the date of GP de-registration, death, 18th birthday or study end whichever was the sooner. Individuals could supply multiple data periods.
Two study cohorts were created for the purpose of this study. The first for the validation exercise, the second for exploring incidence and prevalence (Supplementary File 1 Fig. S1).
Validation cohort. For validation purposes the CVCM dataset was linked with routinely collected data in the SAIL databank following the criteria above at the level of the individual.
We required one assessment in the CVCM dataset per child. Cases were divided into confirmed CM and not confirmed CM (the latter encompassing no maltreatment and possible cases) based on clinician assessment. Therefore, for the children seen on more than one occasion, a hierarchical rule system was adopted, whereby we used the most recent assessment date of any case that had ever been assessed as 'confirmed CM' , followed by the most recent assessment date for the 'no CM' cases.
Only children within the CVCM dataset who were registered to a SAIL-supplying GP, supplied a minimum of 6 months data including the index date, were included for comparisons to the routine data.
Incidence and prevalence cohort. Individuals registered with a SAIL-supplying GP following the inclusion criteria above (independent of linkage to the CVCM dataset) were included in the incidence and prevalence analysis. Data collection for each year began on the 1st on January or the start of follow-up as defined above, whichever was the later, and ended on the 31st December or the end of data collection whichever was the sooner. Person time was calculated between the start and end dates for each year.
Measures. Age and deprivation indices were collected based on the onset of data collection for each year.
Individuals were stratified by sex, age group (< 1 y, 1-4 y, 5-9 y, 10-14 y, 15-17 to align with data from the National Public Health Service for Wales) and quintiles of deprivation. www.nature.com/scientificreports/ Maltreatment was identified from GP data using primary care read codes and from hospital admissions data using ICD 10 codes. The development of the code lists is described below. Code lists for primary care and hospital admissions were developed to map onto one another as closely as possible. However, the coding nomenclature in primary care is broader and encompasses codes not available in hospital admissions data (e.g., child protection codes).
Incidence and prevalence of CM over time was examined in primary care data and admissions data. Incidence was defined as no record in the previous 12 months. Prevalence was defined as any record of maltreatment within a given year, independent of any previous events 21,24 . Development of code lists to identify CM. In order to fully explore the coding of CM in individual healthcare settings two sets of code lists were developed: Read codes for use in GP data and ICD-10 codes for admissions data. While we mapped these two sets of code lists as closely to one another as possible, these distinct coding systems contain different levels of detail. GP data contains a broader coding nomenclature than admissions data, including communication from other healthcare settings and child protection agencies. The utility of each set of codes will be assessed both separately and in combination.
Our codes lists were developed from existing literature 15,16,19,20,[25][26][27][28] , conducting our own searches for codes that may be indicate maltreatment, risk, or cause for concern; and finally based upon clinician judgement. In keeping with previous research, we tiered our code list into those strongly indicating CM/confirmed maltreatment, referred to here as 'Confirmed CM' codes, and those codes that may indicate possible or suspected maltreatment or potential vulnerability 16,20 , referred to here as ''Possible CM' codes. For the Confirmed CM codes in primary care we further divided into prevalent codes and incident codes with codes indicating historical maltreatment (e.g. history of child abuse) excluded from the incident list 18 .
Additional sensitivity analysis was conducted to refine previous lists testing sensitivity and specificity of different subgroups and using clinical input to determine whether these codes are appropriate (excluding for example codes such as 'parents on benefits' and 'self-neglect. ' Supplementary analysis available on request).

Confirmed CM.
Within the category of Confirmed CM were terms that unequivocally stated the existence of maltreatment. This included maltreatment syndromes, history/victim of abuse, prostitution, genital mutilation and criminal neglect/abandonment of baby and child protection categories. Child protection is a response to confirmed maltreatment that has already taken place. This is distinct from safeguarding which refers to measures put in place to prevent harm. Child protection and the presence of maltreatment would be determined by a case conference and this information communicated to GPs and as such should appear in primary care data. However, based on conversations with clinical colleagues child protection information may not always be present in the CVCM dataset and are rarely available in hospital admissions data.
Possible CM. Within the category of 'Possible CM' are codes that may indicate risk and vulnerability of children that may co-occur with maltreatment. However, they are not sufficient to indicate maltreatment in isolation. These codes fell into six categories: 1. At risk/safeguarding codes: For example, 'at risk of abuse' or 'safeguarding example' . Safeguarding was distinguished from child protection which are included in the Confirmed CM list (see above). A child or young person is a safeguarding concern when they are living in circumstances where there is a significant risk of abuse (physical, sexual, emotional or neglect). At-risk codes may not be specific enough to record an event of Confirmed CM, however these codes have utility for identifying children at significant risk. This may have implications for long-term outcomes and future research with vulnerable children. There is no equivalent of these codes in hospital admissions data, so these were searched in GP data only. 2. Other social care: For example, 'referred to social worker' or 'in foster care. ' Social care codes not specifically related to child-protection may also be useful for identifying potentially vulnerable children. However, there are many reasons why a child may have contact with social care that may not be related to maltreatment (e.g., parent or child disability and mental or physical health problems of either the parent or the child). There is no equivalent of these codes in hospital admissions data, so these were searched in GP data only. 3. Family circumstances: This includes codes 'child abuse in family' or 'family member on protection register' .
These codes are indicative of risk but do not necessarily indicate CM for the patient in question 4. Alleged/suspected maltreatment: For example, 'suspected child abuse' , 'alleged abuse' 5. Rib/limb fractures: For example, 'multiple rib fractures' 6. Assaults: This included general codes such as ' Assault' and more specific codes such as 'physical assault at home' . www.nature.com/scientificreports/ We calculated sensitivity, specificity, and positive predictive values (PPV), negative predictive values (NPV) and 95% confidence intervals and explored reasons for identifying false negatives and false positives. Prevalence of maltreatment was reported because this number affects the PPV and NPV. However, this was the prevalence only within the narrowly defined study population, which was defined by hospital evaluation protocols. It is not the prevalence within the general population of the hospital or community 29 .
Trends over time. To assess variation over time, we calculated change in annual incidence and 95% CIs for Confirmed CM and Possible CM rates per 1000 PYAR at risk between 01.01.2004 and 10.10.2020 for children aged under 18 years of age. Poisson regression was undertaken to investigate the adjusted association between incidence and prevalence of maltreatment, and year of diagnosis, sex, age group and deprivation. The significance of the variables in the Poisson regression modelling were assessed using Wald tests. Confidence intervals (CIs) for rates were estimated using two-tailed mid-p exact CIs (assuming Poisson distribution) 30 . Statistical analyses were conducted using SPSS statistical software (version 22).  Table 1.

Results
Previously published code lists from 18 performed with excellent specificity (> 97%). However, sensitivity was low ranging from 7.6% for incident code lists explored in the 12 months either side of an index date to 11.6% for prevalent code lists where records were searched at any time. While specificity for these lists is high around 90% of cases would be missed.
Sensitivity is improved in codes lists from 16 ranging from 51.0 to 64.4%. However, specificity is lower ranging from 86.1 to 90.1%. This list included a wider range of maltreatment related codes including codes related to child protection procedures. While sensitivity is higher, this wider code list allows a wider range of potentially indicative codes.
The highest performing previously published code list was that published by 20 (Table 1). To make this list comparable with the updated algorithm we ran three versions of this list through the validation exercise. Sensitivity ranged from 48.6%-69.3% and specificity from 85.5 to 91.1% and depending on time restrictions and subset of list used. Inclusion of codes related to wider social care outside of child protection increased sensitivity with a negative impact on specificity. Utilising codes categorised as maltreatment and suspected maltreatment resulted in the highest sensitivity ranging from 59.1 to 69.3 with the lowest specificity of 85.7-89.1. Table 1. Sensitivity (95% CI), Specificity (95% CI), PPV(95% CI) NPV(95% CI) of previously published lists when run through validation exercise against gold standard. a PPV and NPV based on prevalence in validation population. b Codes present in GP record within 12 months before or after index date. c Codes present in GP record at any time. d Incident case refers to a new maltreatment event (either first in records or no record within previous 12 months). Codes related to historical maltreatment are excluded. e Prevalent case refers to an event independent of whether there has been a previous maltreatment date. All codes including those related to historical maltreatment are included. f Confirmed CM here refers to prevalence list with historical codes included. . Assault codes also have a large impact on the algorithm increasing sensitivity to 48.6% (95% CI 44.4-52.9) in the 12 months either side of an index date and to 58.8%(95% CI 54.5-62.9) at any time. Inclusion of these codes has a small impact on specificity (12 months 91.2% [95% CI 89.7-92.5]; at any time 87.4%[95% CI 85.7-89.0]). The most frequently used 'assault' code was '[X]Assault' . Refining these codes to only include 'assaults occurring at home' did not impact the algorithm as they were rarely used.
When all additional codes are added ('at risk' , 'family circumstances' , 'other social care' , 'suspected/alleged' , 'rib/ limb fractures' , 'assaults') to the original Confirmed CM list sensitivity is increased with a decrease in specificity ( False positives. Using the Confirmed CM code list (prevalence subset) we incorrectly identified 203 out of 1652 individuals (12.2%) who had been clinically assessed as not being maltreated. Of these 199 had codes relating to child protection and 29 had confirmed maltreatment codes (either current or historical e.g., 'Physical abuse'). The most used codes were '13IM. Child on protection register' (n = 142) and '13IO. Child removed from protection register' (n = 90).

False negatives.
A total of 254 of the 553 clinically confirmed cases were not identified using our algorithm. Of these 78 (30.7%) had at least one of the 'Possible CM' codes. The most frequently occurring of these codes was 'U3… [x] assault' , followed by '13IB0 child in foster care' .
Of the remaining 175 false negatives the most used codes were administrative codes 'e.g., notes summary on computer' or routine codes such as inoculations, or paracetamol prescriptions. Other frequently used codes Table 2. Sensitivity (95% CI), Specificity(95% CI), PPV(95% CI) NPV(95% CI) of code lists for child maltreatment included sensitivity analysis of additional and exclusion of categories of indicative/at-risk codes in primary care. a PPV and NPV based on prevalence in validation population. b Codes present in GP record within 12 months before or after index date. c Codes present in GP record at any time. d Confirmed CM codes encompass maltreatment related syndromes, child protection codes, genital mutilation, and prostitution. e Prevalent case refers to an event independent of whether there has been a previous maltreatment date. All codes including those related to historical maltreatment are included. f Incident case refers to a new maltreatment event (either first in records or no record within previous 12 months). Codes related to historical maltreatment are excluded. g Confirmed CM here refers to the prevalence list with historical codes included. h 'Other social care' refers to social care codes that do not specific child protection for example 'child in care' or 'social services involved' . www.nature.com/scientificreports/ included 'letter from specialist' , 'seen in paediatric clinic' , and codes related to chest infections (e.g. 'chest infection NOS'). There were codes indicating hospital or emergency department attendance (e.g. 'seen in hospital casualty' or 'emergency hospital admission') without specific mention of maltreatment.
Validation exercise 2: Hospital admissions data. Results of the validation exercise in hospital admissions data are shown in Table 3. This includes sensitivity analysis accounting for the impact of adding or removing various groups of codes. Sensitivity of Confirmed CM codes was lower in hospital admissions data than in primary care ranging from 9.4 (95% CI 7. False positives. Using the CM code list, we incorrectly identified 10 out of 1652 individuals (0.6%) who had been clinically assessed as not being maltreated. Of these almost all (numbers masked for confidentiality) had a code for; maltreatment syndromes' (T74) alongside codes for 'other maltreatment' (Y07) and 'neglect and abandonment' (Y06).

False negatives.
A total of 501 of the 553 clinically confirmed cases were not identified using our algorithm. Of these 235 were admitted to hospital within the 12 months either side of the index date.
Of these 61 (26.0%) had at least one of the 'Possible CM' codes. The most commonly occurring of these were codes for Assaults (n = 61) followed by Z638 'Other specified problems related to primary support group (excl. maltreatment syndromes, negative life events in childhood and upbringing; n = 28), and Fractures/dislocations/ rib fractures (n = 14).
Of the remaining 174 false negatives the most used codes were for 'Injury, poisoning and other consequences of external causes' . The most used single codes were S00 'superficial injury of head' , X59 'Exposure to unspecified factor' and K02 'dental caries' . It appears that coding in hospital admissions may be more focused on the injury in need of treatment than on recording the presence of maltreatment. Also of note are the absence of child protection and social care codes that account for a large proportion of correctly identified cases in primary care.

Validation exercise 3: linking GP and admissions data. Combining GP and hospital admissions data
to identify CM improved sensitivity slightly compared to using either dataset individually ( Table 4). There was only a slight decrease in specificity. Similar results were seen when looking at the Possible CM codes (Table 4).
When looking at Confirmed CM 82.1% of cases were found identified in GP data only, 17.9% in admissions data only and 11.6% in both GP and admissions data (Fig. 1). When looking at the Possible CM codes the proportion identified in admissions data increased (GP only 62.4%; admissions only 34.4%; GP and admissions 37.6%).
Incidence and prevalence over time. Incidence of Confirmed CM in both GP and admissions data was comparable between sexes. Incidence decreased with increasing age with the highest incidence in those aged < 1 year (IRR GP 3.5[95% CI 3.1-3.9]; admissions 5.6[95% CI 4.2-7.5] 15-17 years as a reference group). Incidence was highest in the most deprived quintiles with more than five times the risk in GP data and six times the risk in admissions (IRR GP 5.4[95% CI 5.0-5.9]; admissions 6.1[95% CI 4.4-8.4]). Individuals with no deprivation data were also at increased risk (Tables 5 and 6). Table 3. Sensitivity (95% CI), Specificity (95% CI), PPV(95% CI) NPV(95% CI) of code lists for child maltreatment included sensitivity analysis of additional and exclusion of categories of. a PPV and NPV based on prevalence in validation population. b Hospital admission within 12 months before or after index date. c Hospital admission at any time. d Confirmed CM codes encompass maltreatment related syndromes, genital mutilation and prostitution. Note that child protection codes are not available in hospital admissions data. www.nature.com/scientificreports/ When exploring GP contacts for Possible CM demographic indices broadly mirrored that seen for Confirmed CM with little difference between sexes and a decreasing incidence with increasing age (IRR < 1 year 3.5 [95% CI 3.3-3.7] 15-17 years as reference group). An increase in incidence was seen with increasing deprivation, however this was smaller for Possible CM than Confirmed CM with over double the rate in the most deprived compared with the least deprived quintiles (IRR 2.6 [95% CI 2.6-2.8]). Those with unknown deprivation were also at increased risk.
Admissions related to Possible CM demonstrated differing demographic indices than the Confirmed CM with around double the admissions in males than in females and an increase in incidence rate with increasing age (Table 6). Further exploratory analysis revealed that this was attributable to assaults and fractures, rates of which increase with increasing age.
Incidence of GP events for Confirmed CM increased from 3.

Discussion
Main findings. This study demonstrates the creation and first external validation of codes and algorithms to identify cases of CM from routinely collected healthcare data. Sensitivity was higher than that identified by previously published CM code lists. We utilised the validated code lists and found an increase in the incidence and Table 4. Sensitivity (95% CI), specificity (95% CI), PPV(95% CI) NPV(95% CI) of code lists for child maltreatment included sensitivity analysis of additional and exclusion of categories of indicative/at-risk codes in hospital admission data. a PPV and NPV based on prevalence in validation population. b Hospital admission or GP record within 12 months before or after index date. c Hospital admission or GP record at any time. d Confirmed CM codes encompass maltreatment related syndromes, genital mutilation, and prostitutions. Note that child protection codes are not available in hospital admissions data.   We linked a clinically assessed hospital-based CM cohort to cases identified in GP and hospital admissions records to assess the sensitivity, specificity and PPV NPV. Using Confirmed CM codes the algorithm performed with high specificity minimising the proportion of incorrectly identified cases, an important factor for most cohort and case control studies.. The difference between datasets in the ability to identify maltreatment was highlighted, with the majority of cases detected in primary care rather than hospital admissions. Of note, the proportion of cases identified exclusively in admissions data was higher for Possible CM codes than confirmed CM. The true extent of CM is difficult to establish in routine data due to the complexity of the attendance, recognition, recording and coding of maltreatment.
Sensitivity analyses were conducted to encompass a broader range of codes that may indicate maltreatment or individuals who are vulnerable or at risk. This improves sensitivity with a small negative impact on specificity. However, the nature of these codes means that their use should be considered on a study-by-study basis. False negatives frequently had codes for 'other social services' (e.g., child in care) and codes for assaults. While these codes may indicate risk and potential maltreatment, this may not always be the case. For example, in the case of social services codes, children may be involved with social services for a number of reasons including child or parental illness/disability unrelated to maltreatment. These individuals may have many co-occurring risk factors and appear similar to those who have codes for maltreatment in large databases of routinely collected data. The care, support and resources needed will be unique to circumstances and grouping these individuals together for  www.nature.com/scientificreports/  www.nature.com/scientificreports/ research may not be appropriate. Similarly, codes for assault may not always indicate maltreatment. Consideration could be given to apply age constraints to these codes dependent on the study (e.g., assault codes only for those aged under 5 or under 10 years). The 'Possible CM' codes are comprehensive, but cannot be reliably used for case ascertainment, without further evidence to substantiate. Further work, through data learning techniques may prove fruitful to improve performance by combining codes, for instance code terms such as 'maternal concern' along with 'emergency admissions to hospital' . While some iterations of previously published code lists performed with high sensitivity, care must be taken for inclusion of codes that do not indicate maltreatment (e.g., 'parental benefits' or 'self-neglect'). Many of these codes may indicate a potentially vulnerable child with similar risk factors and co-morbidities, however these represent distinct groups of individuals who require different types and levels of support. The importance of care in selecting and validating codes for maltreatment is emphasized.
There was an increase in both incidence and prevalence of Confirmed CM and Possible CM codes from 2004 to 2019 in primary care, with the largest increase seen in Possible CM codes which more than doubled over time. This could reflect a genuine increase, or an increase in GP coding and recognition of vulnerable children. Rates of both CM and Possible CM codes were comparable between sexes, decreased with increasing age and were highest in the most deprived areas. This was most notable for Confirmed CM codes with more than five times the incidence in the most deprived compared with the least deprived areas.
Rates of hospital admissions for Confirmed CM remained at a low rate over time, with an initial increase in Possible CM admissions and a decrease from 2009 onwards.
False positives. Most false positives in GP records were identified from child protection codes. However, without these codes sensitivity is poor, picking up only around one fifth of cases. Child protection is the social services response to harm to a child and it seems reasonable to include these in any list of codes examining maltreatment. These CYP had the same confirmed CM codes recorded in their GP records as the confirmed cases, given this, it would be difficult to further improve specificity. These children may be at high risk of maltreatment, as possible CM is suggested within their medical records, however insufficient evidence of CM was found on the day of assessment.
False negatives. Around half of the clinically confirmed cases were missed using the Confirmed CM code list in GP records. All children within the CVCM cohort would have been assessed because they were considered 'at risk' of or a 'victim of ' CM, therefore all these children would be more likely to have maltreatment-related codes recorded in their medical records than would perhaps be the case for a random sample. We found (30.7%) of these cases did have at least one Possible CM code recorded in their primary care records. Future application of machine learning techniques using the Possible CM code lists may identify combinations of codes that improves sensitivity, with minimal detriment to specificity, to tease apart 'confirmed' from 'possible' cases and to optimise performance.
Around 90% of cases were missed using admissions data. Of those missed around half were not admitted to hospital in the 12 months either side of their maltreatment assessment date and as such cannot be captured by this data. The narrower coding framework used in hospital admissions also limits recording of maltreatment with child protection and social care codes absent. Coding in hospital admissions is focused on the injury being treated and not on whether this was the result of maltreatment. While specificity was high the under-reporting of CM in admissions data must be acknowledged in any future studies.
Comparison to the literature. Incidence rates fall between those reported in two similar studies using routinely collected primary care data conducted in the UK 18,19 . These differences are likely attributable to the choice of codes employed to identify cases of maltreatment, which illustrates the need for standardisation in definitions and subsequent validation. Rates were higher than those reported by Chandan et al. 18 most likely due to the addition of an extensive list of child protection codes in our algorithm. Rates using the confirmed CM list in the current study are lower than those reported by Woodman et al. 19 , however they included a wider set of codes, such as 'out of home care' and 'social care' codes, which we excluded from our Confirmed CM code list but, were present in our Possible CM code lists (incidence using this list was comparable to that found by Woodman et al. 19 ). The findings from all three studies are in agreement that the incidence rates recorded in primary care underestimate the true rates present within the community, although incidence rates for CM reported in the current study are in keeping with ONS data on number of children in Wales subject to a child protection plan 7 .
The increase in CM over time as recorded in primary care is supported by previous research with routinely collected GP data 18,19 . The factors driving these increases in recordings in primary care is less clear. It may be due to raised awareness and real improvements in recognition, responding and changing coding and reporting behaviour to record all concerns of CM 19  There have also been increases in child protection activity in recent years, but it is unclear whether this is because child protection services have become better at recognising and responding to maltreatment. An observational time-series study using official government agency and NSPCC data in England and Wales, found that the incidence of crimes against children, child protection registrations and children entering care had increased steeply between 2000 and 2016 31 . It is difficult to know whether this is part of a trend of increasing reporting, as opposed to rising levels of maltreatment within society. Further time series studies using national survey data may be needed to establish whether CM is becoming more common. www.nature.com/scientificreports/ We found a decrease in recorded CM in primary care in 2020. This is in keeping with concerns that cases of abuse may have been missed due to restricted access to protective services during lockdowns and disruptions to usual safeguarding pathways 32,33 . There are reports of reduced contact with public sector organisations such as schools, hospitals and emergency services in the UK 34 and reductions in children added to the child protection register in 2020 35 . Data from one county borough demonstrated the largest decrease in referrals in the youngest children (aged < 3 years) 35 . This is alongside reports of increased contact with child abuse helplines 36 . Therefore, there may have been a disparity between incidence in public services and incidence in the community.
Rates of CM in primary care were highest in those aged less than a year old, with older adolescents having the lowest rates. Increased GP awareness of maltreatment in younger children, particularly from health workers surveillance and lower consultation rates for older children may be responsible for these differences 19 . Younger children are more likely to come to the attention of children's services, particularly the under 1's 7 . Fewer adolescents are placed on child protection registers than any other age group. This age group may be more at risk of maltreatment through lack of identification and protection measures 31 . Admissions for Possible CM were highest in 15-17 year olds, largely attributable to the higher rates of assaults and fractures/dislocations in older age groups. Further research is needed to explore the nature of admissions for assaults and fractures/dislocations in older age groups and whether these presentations may represent an opportunity to identify and support adolescents at risk of maltreatment.
Incidence of Confirmed CM was more than five times as high in the most deprived compared with the least deprived communities in primary care and more than six times as high in hospital admissions. Individuals with no deprivation data were also at increased risk compared with the least deprived quintiles. This finding has been reported in other studies 18,19 . The relationship between family poverty and the likelihood of a child experiencing maltreatment is already well established 13,37,38 . Strengths and limitations. This study utilised a large population level database and the creation of comprehensive code lists for CM. Extensive manual searching was conducted alongside analysis of missed cases. It appears that the codes identified are exhaustive and that sensitivity could not be improved by adding additional codes. This underscores the importance of understanding that healthcare records underestimate the true incidence of maltreatment in the community.
This study further highlights the strengths and limitations of the individual healthcare datasets, their utility in detecting CM, and the use of these datasets in combination for study of CM. The difference in coding systems mean that comparison of rates between healthcare settings may not be appropriate. Further research is needed to assess comparability with settings outside of the UK.
Future research also may look to explore combinations of codes or machine learning to explore patterns of healthcare utilization to better identify CM in healthcare data.
Routinely collected data have limitations for research purposes, and the quality and completeness of data vary across datasets. We have attempted to minimise the impact of this by only including GPs that meet standards for data quality and validating study code lists. CM not resulting in presentation to services or where CM is discussed but not recorded will not be captured here. This is a common feature of all studies using routine data. These data are a reflection of contacts with the healthcare system, not rates of CM in the community.
There is selection bias in the externally validated CVCM dataset, as it represents a cohort of suspected victims/ at risk of CM. All these CYP are therefore more likely to have a code suggestive of maltreatment recorded within the medical records. This makes improving performance of the algorithm more challenging, as we are effectively attempting to distinguish between 'confirmed' and 'possible' maltreatment cases.
Implications. The validation of codes and development of algorithms from routinely collected datasets that identify cases with high specificity are an important step in epidemiological research. These validated code lists will be applicable to other datasets of routinely collected data and the choice of algorithm will vary with study design. This standardisation is important for research purposes to better understand the true effect and consequences of CM. Around half of the cases missed in admissions data were not admitted to hospital and as such could not be picked up in admissions data. Records of maltreatment in GP data appeared much more frequently. This is likely a combination of the higher number of contacts with primary care, communication between GPs and hospital settings being recorded and the extensive coding nomenclature in GP settings, in particular the presence of child protection and social care codes. This makes GPs better placed to record CM. It is important to include this setting to identify cases where possible. Where CM is being explored in admissions data the limitations must be recognized with around 90% of cases likely to be missed, although specificity in this setting is high.
We add to a body of evidence that CM recorded in primary care data has been increasing and further demonstrate a decreased in recorded CM in 2020.The long-term consequences of this drop in recording of maltreatment during the pandemic and potential disparity with community rates are as yet unknown. This has significance for informing future policy surrounding protective public services.. Future research should seek to explore this, and additional support considered for vulnerable children who may not have been identified during the pandemic.
Individuals in more deprived areas were at markedly increased risk of maltreatment. Also of note is the increased risk of those where deprivation data is unknown. This may indicate unstable living arrangements. Higher rates of maltreatment in these individuals may indicate the need for additional support or service provision. Further research is required to explore how best to support the most deprived communities or individuals where living arrangements may be unstable.

Conclusions
Through the validation and assessment of CM-related codes in healthcare records, we create a platform for future epidemiological research. Time-series analysis on CM population-based epidemiological surveys may be needed to establish whether increasing recognition of cases represents rising trends within the community or whether it is simply due to improvements in recognition and responses or a combination of both. Additional support should be considered for individuals from deprived communities and those who may not have been identified as vulnerable during the pandemic.

Data availability
The data used in this study are available in the SAIL Databank at Swansea University (Swansea, UK) via the Adolescent Mental Health Data Platform, but as restrictions apply they are not publicly available. All proposals to use SAIL data are subject to review by an independent Information Governance Review Panel (IGRP). Before any data can be accessed, approval must be given by the IGRP. The IGRP gives careful consideration to each project to ensure proper and appropriate use of SAIL data. When access has been granted, it is gained through a privacy protecting safe haven and remote access system referred to as the SAIL Gateway. SAIL has established an application process to be followed by anyone who would like to access data via SAIL at https:// saild ataba nk. com/ data/ apply-to-work-with-the-data/.